import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(7, 5))
plt.rc('font', size=12)
import warnings
warnings.simplefilter('ignore')
The Wisconsin breast cancer dataset (WBCD) is a commonly-used dataset for demonstrating binary classification. It is built into sklearn.datasets.
from sklearn.datasets import load_breast_cancer
loaded = load_breast_cancer() # explore the value of `loaded`!
data = loaded['data']
labels = 1 - loaded['target']
cols = loaded['feature_names']
bc = pd.DataFrame(data, columns=cols)
bc.head()
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 30 columns
1 stands for "malignant", i.e. cancerous, and 0 stands for "benign", i.e. safe.
labels[:5]
array([1, 1, 1, 1, 1])
pd.Series(labels).value_counts(normalize=True)
0 0.627417 1 0.372583 dtype: float64
Our goal is to use the features in bc to predict labels.
Logistic regression is a linear classification? technique that builds upon linear regression. It models the probability of belonging to class 1, given a feature vector:
$$P(y = 1 | \vec{x}) = \sigma (\underbrace{w_0 + w_1 x^{(1)} + w_2 x^{(2)} + ... + w_d x^{(d)}}_{\text{linear regression model}})$$Here, $\sigma(t) = \frac{1}{1 + e^{-t}}$ is the sigmoid function; its outputs are between 0 and 1 (which means they can be interpreted as probabilities).
🤔 Question: Suppose our logistic regression model predicts the probability that a tumor is malignant is 0.75. What class do we predict – malignant or benign? What if the predicted probability is 0.3?
🙋 Answer: We have to pick a threshold (e.g. 0.5)!
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(bc, labels)
clf = LogisticRegression()
clf.fit(X_train, y_train)
LogisticRegression()
How did clf come up with 1s and 0s?
clf.predict(X_test)
array([1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0,
0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0,
0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0,
1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1])
It turns out that the predicted labels come from applying a threshold of 0.5 to the predicted probabilities. We can access the predicted probabilities via the predict_proba method:
# [:, 1] refers to the predicted probabilities for class 1
clf.predict_proba(X_test)[:, 1]
array([9.99962449e-01, 2.00820261e-01, 3.07062878e-03, 8.43082162e-01,
1.82204801e-03, 1.00000000e+00, 1.59756542e-02, 2.36453802e-02,
2.53007559e-03, 8.61441072e-02, 2.88143693e-03, 3.23793755e-02,
9.95776303e-01, 2.77629531e-03, 2.70357622e-03, 1.67298404e-01,
9.85035355e-04, 1.00000000e+00, 1.49486589e-01, 6.17862722e-02,
9.99999972e-01, 9.99101366e-01, 7.64809824e-02, 4.13729381e-03,
3.67511833e-03, 8.99509546e-01, 9.99782713e-01, 1.46093074e-02,
1.85438932e-02, 1.84931603e-02, 8.69319967e-03, 4.85725190e-03,
2.56943792e-02, 1.00000000e+00, 1.47699301e-03, 9.99999986e-01,
7.38597732e-03, 9.99999998e-01, 3.65161851e-03, 9.51064280e-01,
9.99995047e-01, 1.38597304e-03, 2.14541810e-01, 5.05984762e-03,
8.96614097e-03, 9.84118653e-01, 3.94578279e-01, 1.00000000e+00,
1.13836605e-02, 1.23516629e-03, 5.19645168e-02, 9.99937322e-01,
6.92347699e-01, 2.33421162e-02, 2.41327903e-02, 9.99998841e-01,
1.76103242e-02, 5.54019707e-03, 9.99999998e-01, 3.04249473e-02,
9.99966114e-01, 2.67399392e-01, 9.98769576e-01, 9.23543864e-04,
4.13741972e-03, 1.99201446e-02, 9.99798729e-01, 2.67156556e-03,
1.13636300e-02, 1.00000000e+00, 4.72249512e-03, 5.30053891e-03,
4.19381458e-03, 9.99999940e-01, 9.99566821e-01, 7.06258480e-01,
3.32427690e-04, 7.61376563e-01, 9.97800686e-01, 9.69900997e-01,
9.99997823e-01, 3.86713912e-03, 4.19193535e-04, 9.63501213e-01,
9.98950746e-01, 1.25353669e-02, 8.25718137e-01, 6.61668154e-02,
7.86267546e-04, 2.00025937e-02, 6.72853558e-01, 3.95927606e-01,
5.40187650e-03, 9.99998750e-01, 1.15475440e-02, 1.51972741e-02,
1.15945715e-01, 9.99260561e-01, 4.16849121e-03, 6.04594992e-04,
9.96611474e-01, 6.56395053e-01, 1.21147186e-01, 2.41182003e-01,
9.99998093e-01, 8.08703708e-02, 1.09079755e-03, 9.99999929e-01,
1.32816124e-02, 4.97925708e-04, 2.30336353e-03, 9.99999953e-01,
9.99086637e-01, 3.21680214e-02, 9.69472435e-01, 1.27230520e-03,
2.25041854e-03, 1.00000000e+00, 9.99999940e-01, 9.99916637e-01,
4.03378095e-03, 1.08021099e-01, 3.07095414e-04, 9.99997656e-01,
9.99999989e-01, 1.00000000e+00, 6.22245527e-04, 8.31936385e-02,
4.54887058e-03, 9.99999994e-01, 2.09695481e-05, 9.21071126e-04,
9.99998400e-01, 8.22050636e-03, 1.00000000e+00, 1.00000000e+00,
2.35015086e-03, 6.82419126e-02, 9.99369521e-01, 9.99999706e-01,
9.79530846e-01, 5.03018816e-02, 9.99999379e-01])
Note that our model still has $w^*$s:
clf.intercept_
array([-0.26842224])
clf.coef_
array([[-1.49816748, -0.44554582, 0.02563369, 0.00416895, 0.06255073,
0.27568819, 0.37596657, 0.1645085 , 0.0951418 , 0.01712652,
-0.04233645, -0.67642204, -0.21986992, 0.07577716, 0.00676443,
0.05156208, 0.07422026, 0.02254724, 0.02055806, 0.00553588,
-1.53680226, 0.4769345 , 0.16351417, 0.02198396, 0.10805671,
0.80264563, 0.97024024, 0.3097782 , 0.24980996, 0.0759066 ]])
Let's see how well our model does on the test set.
from sklearn import metrics
y_pred = clf.predict(X_test)
metrics.accuracy_score(y_test, y_pred)
0.965034965034965
metrics.precision_score(y_test, y_pred)
0.9827586206896551
metrics.recall_score(y_test, y_pred)
0.9344262295081968
Which metric is more important for this task – precision or recall?
metrics.confusion_matrix(y_test, y_pred)
array([[81, 1],
[ 4, 57]])
metrics.plot_confusion_matrix(clf, X_test, y_test);
🤔 Question: Suppose we choose a threshold higher than 0.5. What will happen to our model's precision and recall?
🙋 Answer: Precision will increase, while recall will decrease*.
Similarly, if we decrease our threshold, our model's precision will decrease, while its recall will increase.
The classification threshold is not actually a hyperparameter of LogisticRegression, because the threshold doesn't change the coefficients ($w^*$s) of the logistic regression model itself (see this article for more details).
As such, if we want to imagine how our predicted classes would change with thresholds other than 0.5, we need to manually threshold.
thresholds = np.arange(0, 1.01, 0.01)
precisions = np.array([])
recalls = np.array([])
for t in thresholds:
y_pred = clf.predict_proba(X_test)[:, 1] >= t
precisions = np.append(precisions, metrics.precision_score(y_test, y_pred))
recalls = np.append(recalls, metrics.recall_score(y_test, y_pred))
Let's visualize the results in plotly, which is interactive.
px.line(x=thresholds, y=precisions,
labels={'x': 'Threshold', 'y': 'Precision'}, title='Precision vs. Threshold', width=1000, height=600)
px.line(x=thresholds, y=recalls,
labels={'x': 'Threshold', 'y': 'Recall'}, title='Recall vs. Threshold', width=1000, height=600)
px.line(x=recalls, y=precisions, hover_name=thresholds,
labels={'x': 'Recall', 'y': 'Precision'}, title='Precision vs. Recall')
The above curve is called a precision-recall (or PR) curve.
🤔 Question: Based on the PR curve above, what threshold would you choose?
If we care equally about a model's precision $PR$ and recall $RE$, we can combine the two using a single metric called the F1-score:
$$\text{F1-score} = \text{harmonic mean}(PR, RE) = 2\frac{PR \cdot RE}{PR + RE}$$pr = metrics.precision_score(y_test, clf.predict(X_test))
re = metrics.recall_score(y_test, clf.predict(X_test))
2 * pr * re / (pr + re)
0.9579831932773109
metrics.f1_score(y_test, clf.predict(X_test))
0.9579831932773109
Both F1-score and accuracy are overall measures of a binary classifier's performance. But remember, accuracy is misleading in the presence of class imbalance, and doesn't take into account the kinds of errors the classifier makes.
metrics.accuracy_score(y_test, clf.predict(X_test))
0.965034965034965
We just scratched the surface! This excellent table from Wikipedia summarizes the many other metrics that exist.

If you're interested in exploring further, a good next metric to look at is true negative rate (i.e. specificity), which is the analogue of recall for true negatives.
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a "black-box" model that estimates the likelihood that someone who has commited a crime will recidivate (commit another crime).

Propublica found that the model's false positive rate is higher for African-Americans than it is for White Americans, and that its false negative rate is lower for African-Americans than it is for White Americans.

Note:
$$PPV = \text{precision} = \frac{TP}{TP+FP},\:\:\:\:\:\: TPR = \text{recall} = \frac{TP}{TP + FN}, \:\:\:\:\:\: FPR = \frac{FP}{FP+TN}$$Remember, our models learn patterns from the training data. Various sources of bias may be present within training data:
A 2015 study examined the image queries of vocations and the gender makeup in the search results. Since 2015, the behavior of Google Images has been improved.
In 2015, a Google Images search for "nurse" returned...

Search for "nurse" now, what do you see?
In 2015, a Google Images search for "doctor" returned...

Search for "doctor" now, what do you see?

Excerpts:
"male-dominated professions tend to have even more men
in their results than would be expected if the proportions reflected real-world distributions.
"People’s existing perceptions of gender ratios in occupations
are quite accurate, but that manipulated search results have an effect on perceptions."
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(7, 5))
plt.rc('font', size=12)
import warnings
warnings.simplefilter('ignore')
LendingClub is a "peer-to-peer lending company"; they used to publish a dataset describing the loans that they approved (fortunately, we downloaded it while it was available).
'tag': whether loan was repaid in full (1.0) or defaulted (0.0)'loan_amnt': amount of the loan in dollars'emp_length': number of years employed'home_ownership': whether borrower owns (1.0) or rents (0.0)'inq_last_6mths': number of credit inquiries in last six months'revol_bal': revolving balance on borrows accounts'age': age in years of the borrower (protected attribute)loans = pd.read_csv('data/loan_vars1.csv', index_col=0)
loans.head()
The total amount of money loaned was over 5 billion dollars!
loans['loan_amnt'].sum()
loans.shape[0]
'tag'¶Let's build a classifier that predicts whether or not a loan was paid in full. If we were a bank, we could use our trained classifier to determine whether to approve someone for a loan!
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X = loans.drop('tag', axis=1)
y = loans.tag
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = RandomForestClassifier(n_estimators=50)
clf.fit(X_train, y_train)
Recall, a prediction of 1 means that we predict that the loan will be paid in full.
y_pred = clf.predict(X_test)
y_pred
clf.score(X_test, y_test)
from sklearn import metrics
metrics.plot_confusion_matrix(clf, X_test, y_test);
Precision describes the proportion of loans that were approved that would have been paid back.
metrics.precision_score(y_test, y_pred)
If we subtract the precision from 1, we get the proportion of loans that were approved that would not have been paid back. This is known as the false discovery rate.
$$\frac{FP}{TP + FP} = 1 - \text{precision}$$1 - metrics.precision_score(y_test, y_pred)
Recall describes the proportion of loans that would have been paid back that were actually approved.
metrics.recall_score(y_test, y_pred)
If we subtract the recall from 1, we get the proportion of loans that would have been paid back that were denied. This is known as the false negative rate.
$$\frac{FN}{TP + FN} = 1 - \text{recall}$$1 - metrics.recall_score(y_test, y_pred)
From both the perspective of the bank and the lendee, a high false negative rate is bad!
results = X_test
results['age_bracket'] = results['age'].apply(lambda x: 5 * (x // 5 + 1))
results['prediction'] = y_pred
results['tag'] = y_test
(
results
.groupby('age_bracket')
.apply(lambda x: 1 - metrics.recall_score(x['tag'], x['prediction']))
.plot(kind='bar', title='False Negative Rate by Age Group')
);
results['is_young'] = (results.age < 25).replace({True: 'young', False: 'old'})
First, let's compute the proportion of loans that were approved in each group. If these two numbers are the same, $C$ achieves demographic parity.
results.groupby('is_young')['prediction'].mean().to_frame()
$C$ evidently does not achieve demographic parity – older people are approved for loans far more often! Note that this doesn't factor in whether they were correctly approved or incorrectly approved.
Now, let's compute the accuracy of $C$ in each group. If these two numbers are the same, $C$ achieves accuracy parity.
(
results
.groupby('is_young')
.apply(lambda x: metrics.accuracy_score(x['tag'], x['prediction']))
.rename('accuracy')
.to_frame()
)
Hmm... These numbers look much more similar than before!
Let's run a permutation test to see if the difference in accuracy is significant.
obs = results.groupby('is_young').apply(lambda x: metrics.accuracy_score(x['tag'], x['prediction'])).diff().iloc[-1]
obs
diff_in_acc = []
for _ in range(100):
s = (
results[['is_young', 'prediction', 'tag']]
.assign(is_young=results.is_young.sample(frac=1.0, replace=False).reset_index(drop=True))
.groupby('is_young')
.apply(lambda x: metrics.accuracy_score(x['tag'], x['prediction']))
.diff()
.iloc[-1]
)
diff_in_acc.append(s)
plt.figure(figsize=(10, 5))
pd.Series(diff_in_acc).plot(kind='hist', ec='w', density=True, bins=15, title='Difference in Accuracy (Young - Old)')
plt.axvline(x=obs, color='red', label='observed difference in accuracy')
plt.legend(loc='upper left');
It seems like the difference in accuracy across the two groups is significant, despite being only ~6%. Thus, $C$ likely does not achieve accuracy parity.
Not only should we use 'age' to determine whether or not to approve a loan, but we also shouldn't use other features that are strongly correlated with 'age', like 'emp_length'.
loans

In this course, you...
Now, you...
We learnt a lot this quarter.
pandasscikit-learnThis course would not have been possible without:
Our course staff: Shyam Renjith, Nicole Brye, Costin Smilovici, Yuxin Guo, Lucas Lee, Yunfan Long, Angela Wang, Zhaoyi Yu
Don't be a stranger!
Apply to be a tutor in the future! Learn more here.